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Abstract 


As several industries are moving towards modeling massive 3D virtual worlds, 
the need for content creation tools that can scale in terms of the quantity, quality, 
and diversity of 3D content is becoming evident. In our work, we aim to train 
performant 3D generative models that synthesize textured meshes that can be 
directly consumed by 3D rendering engines, thus immediately usable in down- 
stream applications. Prior works on 3D generative modeling either lack geometric 
details, are limited in the mesh topology they can produce, typically do not support 
textures, or utilize neural renderers in the synthesis process, which makes their 
use in common 3D software non-trivial. In this work, we introduce GET3D, 
a Generative model that directly generates Explicit Textured 3D meshes with 
complex topology, rich geometric details, and high fidelity textures. We bridge 
recent success in the differentiable surface modeling, differentiable rendering as 
well as 2D Generative Adversarial Networks to train our model from 2D image 
collections. GET3D is able to generate high-quality 3D textured meshes, rang- 
ing from cars, chairs, animals, motorbikes and human characters to buildings, 
achieving significant improvements over previous methods. Our project page: 
https://nv-tlabs.github.io/GET3D 


1 Introduction 


Diverse, high-quality 3D content is becoming increasingly important for several industries, including 
gaming, robotics, architecture, and social platforms. However, manual creation of 3D assets is very 
time-consuming and requires specific technical knowledge as well as artistic modeling skills. One 
of the main challenges is thus scale — while one can find 3D models on 3D marketplaces such as 
Turbosquid [4] or Sketchfab [3], creating many 3D models to, say, populate a game or a movie with a 
crowd of characters that all look different still takes a significant amount of artist time. 


To facilitate the content creation process and make it accessible to a variety of (novice) users, 
generative 3D networks that can produce high-quality and diverse 3D assets have recently become 
an active area of research [5, 14, 43, 46, 53, 68, 75, 60, 59, 69, 23]. However, to be practically 
useful for current real-world applications, 3D generative models should ideally fulfill the following 
requirements: (a) They should have the capacity to generate shapes with detailed geometry and 
arbitrary topology, (b) The output should be a textured mesh, which is a primary representation used 
by standard graphics software packages such as Blender [15] and Maya [1], and (c) We should be 
able to leverage 2D images for supervision, as they are more widely available than explicit 3D shapes. 


Prior work on 3D generative modeling has focused on subsets of the above requirements, but no 
method to date fulfills all of them (Tab. 1). For example, methods that generate 3D point clouds [5, 
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Method Application Representation | Supervision Texturedmesh Arbitrary topology 
OccNet [43] 3D generation Implicit 3D x v 
PointFlow [68] 3D generation Point cloud 3D x v 
Texture3D [53] | 3D generation Mesh 2D Jv x 
StyleNerf [25] 3D-aware NV Neural field 2D x J 
EG3D [8] 3D-aware NV Neural field 2D x J 
PiGAN [7] 3D-aware NV Neural field 2D x J 
GRAF [57] 3D-aware NV Neural field 2D x J 
Ours 3D generation Mesh 2D "A v 


Table 1: Comparison with prior works. (NV: Novel view synthesis.) 


68, 75] typically do not produce textures and have to be converted to a mesh in post-processing. 
Methods generating voxels often lack geometric details and do not produce texture [66, 20, 27, 40]. 
Generative models based on neural fields [43, 14] focus on extracting geometry but disregard texture. 
Most of these also require explicit 3D supervision. Finally, methods that directly output textured 
3D meshes [54, 53] typically require pre-defined shape templates and cannot generate shapes with 
complex topology and variable genus. 


Recently, rapid progress in neural volume rendering [45] and 2D Generative Adversarial Networks 
(GANS) [34, 35, 33, 29, 52] has led to the rise of 3D-aware image synthesis [7, 57, 8, 49, 51, 25]. 
However, this line of work aims to synthesize multi-view consistent images using neural rendering in 
the synthesis process and does not guarantee that meaningful 3D shapes can be generated. While a 
mesh can potentially be obtained from the underlying neural field representation using the marching 
cube algorithm [39], extracting the corresponding texture is non-trivial. 


In this work, we introduce a novel approach that aims to tackle all the requirements of a practically 
useful 3D generative model. Specifically, we propose GET3D, a Generative model for 3D shapes 
that directly outputs Explicit Textured 3D meshes with high geometric and texture detail and arbitrary 
mesh topology. In the heart of our approach is a generative process that utilizes a differentiable 
explicit surface extraction method [60] and a differentiable rendering technique [47, 37]. The former 
enables us to directly optimize and output textured 3D meshes with arbitrary topology, while the latter 
allows us to train our model with 2D images, thus leveraging powerful and mature discriminators 
developed for 2D image synthesis. Since our model directly generates meshes and uses a highly 
efficient (differentiable) graphics renderer, we can easily scale up our model to train with image 
resolution as high as 1024 x 1024, allowing us to learn high-quality geometric and texture details. 


We demonstrate state-of-the-art performance for unconditional 3D shape generation on multiple 
categories with complex geometry from ShapeNet [9], Turbosquid [4] and Renderpeople [2], such as 
chairs, motorbikes, cars, human characters, and buildings. With explicit mesh as output representation, 
GET3D is also very flexible and can easily be adapted to other tasks, including: (a) learning to 
generate decomposed material and view-dependent lighting effects using advanced differentiable 
rendering [12], without supervision, (b) text-guided 3D shape generation using CLIP [56] embedding. 


2 Related Work 


We review recent advances in 3D generative models for geometry and appearance, as well as 3D-aware 
generative image synthesis. 


3D Generative Models In recent years, 2D generative models have achieved photorealistic quality 
in high-resolution image synthesis [34, 35, 33, 52, 29, 19, 16]. This progress has also inspired 
research in 3D content generation. Early approaches aimed to directly extend the 2D CNN generators 
to 3D voxel grids [66, 20, 27, 40, 62], but the high memory footprint and computational complexity of 
3D convolutions hinder the generation process at high resolution. As an alternative, other works have 
explored point cloud [5, 68, 75, 46], implicit [43, 14], or octree [30] representations. However, these 
works focus mainly on generating geometry and disregard appearance. Their output representations 
also need to be post-processed to make them compatible with standard graphics engines. 


More similar to our work, Textured3 DGAN [54, 53] and DIBR [11] generate textured 3D meshes, 
but they formulate the generation as a deformation of a template mesh, which prevents them from 
generating complex topology or shapes with varying genus, which our method can do. PolyGen [48] 
and SurfGen [41] can produce meshes with arbitrary topology, but do not synthesize textures. 
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Figure 1: We export our generated shapes and visualize them in Blender. GET3D is able to generate 
diverse shapes with arbitrary topology, high quality geometry, and texture. 


3D-Aware Generative Image Synthesis Inspired by the success of neural volume rendering [45] 
and implicit representations [43, 14], recent work started tackling the problem of 3D-aware image 
synthesis [7, 57, 49, 26, 25, 76, 8, 51, 58, 67]. However, neural volume rendering networks are 
typically slow to query, leading to long training times [7, 57], and generate images of limited 
resolution. GIRAFFE [49] and StyleNerf [25] improve the training and rendering efficiency by 
performing neural rendering at a lower resolution and then upsampling the results with a 2D CNN. 
However, the performance gain comes at the cost of a reduced multi-view consistency. By utilizing a 
dual discriminator, EG3D [8] can partially mitigate this problem. Nevertheless, extracting a textured 
surface from methods that are based on neural rendering is a non-trivial endeavor. In contrast, 
GET3D directly outputs textured 3D meshes that can be readily used in standard graphics engines. 


3 Method 


We now present our GET3D framework for synthesizing textured 3D shapes. Our generation process 
is split into two parts: a geometry branch, which differentiably outputs a surface mesh of arbitrary 
topology, and a texture branch that produces a texture field that can be queried at the surface points to 
produce colors. The latter can be extended to other surface properties such as for example materials 
(Sec. 4.3.1). During training, an efficient differentiable rasterizer is utilized to render the resulting 
textured mesh into 2D high-resolution images. The entire process is differentiable, allowing for 
adversarial training from images (with masks indicating an object of interest) by propagating the 
gradients from the 2D discriminator to both generator branches. Our model is illustrated in Fig. 2. In 
the following, we first introduce our 3D generator in Sec 3.1, before proceeding to the differentiable 
rendering and loss functions in Sec 3.2. 


3.1 Generative Model of 3D Textured Meshes 


We aim to learn a 3D generator WM, E = G(z) to map a sample from a Gaussian distribution 
z © N(0,1) toa mesh M with texture E. 


Since the same geometry can have different textures, and the same texture can be applied to different 
geometries, we sample two random input vectors z; € R°!” and z2 € R°!”. Following StyleGAN [34, 
35, 33], we then use non-linear mapping networks feo and fiex to map z and Z» to intermediate 
latent vectors W1 = fgeo(Z1) and W2 = fiex(Z2) which are further used to produce styles that control 
the generation of 3D shapes and texture, respectively. We formally introduce the generator for 
geometry in Sec. 3.1.1 and the texture generator in Sec. 3.1.2. 
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Figure 2: Overview of GET3D: We generate a 3D SDF and a texture field via two latent codes. We 
utilize DMTet [60] to extract a 3D surface mesh from the SDF, and query the texture field at surface 
points to get colors. We train with adversarial losses defined on 2D images. In particular, we use 
a rasterization-based differentiable renderer [37] to obtain RGB images and silhouettes. We utilize 
two 2D discriminators, each on RGB image, and silhouette, respectively, to classify whether the 
inputs are real or fake. The whole model is end-to-end trainable. Note that we additionally provide 
an improved version of our Generator in Appendix A.5 and Fig. C. 


3.1.1 Geometry Generator 


We design our geometry generator to incorporate DMTet [60], a recently proposed differentiable 
surface representation. DMTet represents geometry as a signed distance field (SDF) defined on a 
deformable tetrahedral grid [22, 24], from which the surface can be differentiably recovered through 
marching tetrahedra [17]. Deforming the grid by moving its vertices results in a better utilization 
of its resolution. By adopting DMTet for surface extraction, we can produce explicit meshes with 
arbitrary topology and genus. We next provide a brief summary of DMTet and refer the reader to the 
original paper for further details. 


Let (Vr, TZ’) denote the full 3D space that the object lies in, where V7 are the vertices in the tetrahedral 
grid T. Each tetrahedron 7; € T is defined using four vertices {Vq,, Vo, Vc,, Va, }, With k © 
{1,...,}, where K is the total number of tetrahedra, and v;, € Vr, vi, € R°. In addition to its 
3D coordinates, each vertex v; contains the SDF value s; € R and the deformation Av; € R? of the 
vertex from its initial canonical coordinate. This representation allows recovering the explicit mesh 
through differentiable marching tetrahedra [60], where SDF values in continuous space are computed 
by a barycentric interpolation of their value s; on the deformed vertices vi = v; + Av;. 


Network Architecture We map w; € R°!? to SDF values and deformations at each vertex v; 
through a series of conditional 3D convolutional and fully connected layers. Specifically, we first use 
3D convolutional layers to generate a feature volume conditioned on w,. We then query the feature 
at each vertex v; € Vr using trilinear interpolation and feed it into MLPs that outputs the SDF value 
s; and the deformation Av;. In cases where modeling at a high-resolution is required (e.g. motorbike 
with thin structures in the wheels), we further use volume subdivision following [60]. 


Differentiable Mesh Extraction After obtaining s; and Av; for all the vertices, we use the 
differentiable marching tetrahedra algorithm to extract the explicit mesh. Marching tetrahedra 
determines the surface topology within each tetrahedron based on the signs of s;. In particular, 
a mesh face is extracted when sign(s;) ¢ sign(s,;), where 7, 7 denotes the indices of vertices in 
the edge of tetrahedron, and the vertices m;,; of that face are determined by a linear interpolation 


ee Note that the above equation is only evaluated when s; ¥ s,, thus it is 
differentiable, and the gradient from m,;,; can be back-propagated into the SDF values s; and 
deformations Av;. With this representation, the shapes with arbitrary topology can easily be 


generated by predicting different signs of s;. 
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3.1.2 Texture Generator 


Directly generating a texture map consistent with the output mesh 1s not trivial, as the generated shape 
can have an arbitrary genus and topology. We thus parameterize the texture as a texture field [50]. 


Specifically, we model the texture field with a function jf; that maps the 3D location of a surface point 
p € R®, conditioned on the wo, to the RGB color c € R?® at that location. Since the texture field 
depends on geometry, we additionally condition this mapping on the geometry latent code w1, such 
that c = f;(p,w1 © wW2), where @ denotes concatenation. 


Network Architecture We represent our texture field using a tri-plane representation, which is 
efficient and expressive in reconstructing 3D objects [55] and generating 3D-aware images [8] . 
Specifically, we follow [8, 35] and use a conditional 2D convolutional neural network to map the 
latent code w, © wz to three axis-aligned orthogonal feature planes of size N x N x (C' x 3), where 
N = 256 denotes the spatial resolution and C’ = 32 the number of channels. 


Given the feature planes, the feature vector f € R°? of a surface point p can be recovered as 
f' = )~. p(me(p)), where 7.(p) is the projection of the point p to the feature plane e and p(-) 
denotes bilinear interpolation of the features. An additional fully connected layer is then used to 
map the aggregated feature vector f* to the RGB color c. Note that, different from other works on 
3D-aware image synthesis [8, 25, 7, 57] that also use a neural field representation, we only need to 
sample the texture field at the locations of the surface points (as opposed to dense samples along a 
ray). This greatly reduces the computational complexity for rendering high-resolution images and 
guarantees to generate multi-view consistent images by construction. 


3.2 Differentiable Rendering and Training 


In order to supervise our model during training, we draw inspiration from Nvdiffrec [47] that performs 
multi-view 3D object reconstruction by utilizing a differentiable renderer. Specifically, we render 
the extracted 3D mesh and the texture field into 2D images using a differentiable renderer [37], and 
supervise our network with a 2D discriminator, which tries to distinguish the image from a real object 
or rendered from the generated object. 


Differentiable Rendering We assume that the camera distribution C that was used to acquire the 
images in the dataset is known. To render the generated shapes, we randomly sample a camera c from 
C, and utilize a highly-optimized differentiable rasterizer Nvdiffrast [37] to render the 3D mesh into a 
2D silhouette as well as an image where each pixel contains the coordinates of the corresponding 3D 
point on the mesh surface. These coordinates are further used to query the texture field to obtain the 
RGB values. Since we operate directly on the extracted mesh, we can render high-resolution images 
with high efficiency, allowing our model to be trained with image resolution as high as 1024 x 1024. 


Discriminator & Objective We train our model using an adversarial objective. We adopt the 
discriminator architecture from StyleGAN [34], and use the same non-saturating GAN objective with 
R1 regularization [42]. We empirically find that using two separate discriminators, one for RGB 
images and another one for silhouettes, yields better results than a single discriminator operating on 
both. Let D,, denote the discriminator, where x can either be an RGB image or a silhouette. The 
adversarial objective is then be defined as follows: 


L(Dz, G) = Exen cec[9(Dz(R(G(z),¢)))] + Ex,ep.[9(-Dz(Ex)) + Al|VDz(Ie)|l2], 


where g(u) is defined as g(u) = — log(1+exp(—u)), pz is the distribution of real images, R denotes 
rendering, and 4 is a hyperparameter. Since R is differentiable, the gradients can be backpropagated 
from 2D images to our 3D generators. 


Regularization To remove internal floating faces that are not visible in any of the views, we further 
regularize the geometry generator with a cross-entropy loss defined between the SDF values of the 
neighboring vertices [47]: 


Lreg = ) > H (a(s:), sign (s;)) +H (o(s;), sign (si), (2) 
i,j Se 
where H denotes binary cross-entropy loss and o denotes the sigmoid function. The sum in Eq. 2 is 
defined over the set of unique edges S, in the tetrahedral grid, for which sign(s;) ¢ sign(s;). 


The overall loss function is then defined as: 
L= L(Drep , G) + 1) Dice, G) ae LLreg, (3) 


where ju is a hyperparameter that controls the level of regularization. 
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Category Method EOYs) MME HD) Category Method CONSES)) MMD Hp 
LFD CD LFD CD Ori 3D LFD CD LFD CD Ori 3D 
PointFlow [68] 51.91 57.16 1971 0.82 - - PointFlow [68] 50.68 63.01 4023 1.38 
OccNet [43] 27.29 42.63 1717 0.61 - - OccNet [43] 30.14 47.95 4551 2.04 - - 
Pi-GAN [7] 0.82 0.55 6626 25.54 52.82 104.29 Pi-GAN [7] 2.74 6.85 8864 21.08 72.67 131.38 
Car GRAF [57] 1.57 1.57 6012 10.63 49.95 52.85 Mbike GRAF [57] 43.84 50.68 4528 2.40 83.20 113.39 
EG3D [8] 60.16 49.52 1527 0.72 15.52 21.89 EG3D [8] 38.36 34.25 4199 2.21 66.38 89.97 
Ours 66.78 58.39 1491 0.71 10.25 10.25 Ours 67.12 67.12 3631 1.72 65.60 65.60 
Ours+Subdiv. 62.48 55.93 1553 0.72 12.14 12.14 Ours+Subdiv. 63.01 61.64 3440 1.79 54.12 54.12 
Ours (improved G) 59.00 47.95 1473 0.81 10.60 10.60 Ours (improved G) 69.86 65.75 3393 1.79 48.90 48.90 
PointFlow [68] 49.58 71.87 3755 3.03 - - PointFlow [68] 42.70 74.16 4885 1.68 
OccNet [43] 61.10 67.13 3494 3.98 - - OccNet [43] 56.18 75.28 4418 2.39 - - 
Pi-GAN [7] 53.76 39.65 4092 6.65 65.70 120.53 Pi-GAN [7] 31.46 30.34 6084 8.37 36.26 150.86 
Chair GRAF [57] 50.23 39.28 4055 680 43.82 61.63 ee GRAF [57] 60.67 61.80 5083 4.81 42.07 52.48 
EG3D [8] 58.31 50.14 3444 4.72 38.87 46.06 EG3D [8] 74.16 58.43 4889 3.42 40.03 83.47 
Ours 69.08 69.91 3167 3.72 23.28 23.28 Ours 79.77 78.65 3798 2.02 28.33 28.33 
Ours+Subdiv. 71.59 70.84 3163 3.95 23.17 23.17 Ours+Subdiv. 66.29 74.16 3864 2.03 28.49 28.49 
Ours (improved G) 71.96 71.96 3125 3.96 22.41 22.41 Ours (improved G) 74.16 82.02 3767 1.97 27.18 27.18 


Table 2: Quantitative evaluation of generation results: +: the higher the better, |: the lower the 
better. The best scores are highlighted in bold. MMD-CD scores are multiplied by 10°. The results of 
Ours (improved G) were obtained after the review process by improving the design of the generator 
network architecture G (see Appendix A.5 for more details). 
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Figure 3: Qualitative comparison of GET3D to the baseline methods in terms of extracted 3D 
geometry. GET3D is able to generate shapes with much higher geometric detail across all categories. 


4 Experiments 


We conduct extensive experiments to evaluate our model. We first compare the quality of the 
3D textured meshes generated by GET3D to the existing methods using the ShapeNet [9] and 
Turbosquid [4] datasets. Next, we ablate our design choices in Sec. 4.2. Finally, we demonstrate the 
flexibility of GET3D by adapting it to downstream applications in Sec. 4.3. Additional experimental 
results and implementation details are provided in Appendix. 


4.1 Experiments on Synthetic Datasets 


Datasets For evaluation on ShapeNet [9], we use three categories with complex geometry — Car, 
Chair, and Motorbike, which contain 7497, 6778, and 337 shapes, respectively. We randomly split 
each category into training (70%), validation (10 %), and test (20 %), and further remove from the 
test set shapes that have duplicates in the training set. To render the training data, we randomly 
sample camera poses from the upper hemisphere of each shape. For the Car and Chair categories, 
we use 24 random views, while for Motorbike we use 100 views due to less number of shapes. As 
models in ShapeNet only have simple textures, we also evaluate GET3D on an Animal dataset (442 
shapes) collected from TurboSquid [4], where textures are more detailed and we split it into training, 
validation and test as defined above. Finally, to demonstrate the versatility of GET3D, we also 
provide qualitative results on the House dataset collected from Turbosquid (563 shapes), and Human 
Body dataset from Renderpeople [2] (500 shapes). We train a separate model on each category. 


Car Animal 


Chair 


Motorbike 
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Figure 4: Qualitative comparison of GET3D to the baseline methods in terms of generated 2D 
images. GET3D generates sharp textures with high level of detail. 
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Figure 5: Shapes generated by GET3D rendered in Blender. GET3D generates high-quality 
shapes with diverse texture, high-quality geometry, and complex topology. Zoom-in for details. 


Baselines We compare GET3D to two groups of works: 1) 3D generative models that rely on 
3D supervision: PointFlow [68] and OccNet [43]. Note that these methods only generate geometry 
without texture. 2) 3D-aware image generation methods: GRAF [57], PiGAN [7], and EG3D [8]. 


Metrics To evaluate the quality of our synthesis, we consider both the geometry and texture of 
the generated shapes. For geometry, we adopt metrics from [5] and use both Chamfer Distance 
(CD) and Light Field Distance [10] (LFD) to compute the Coverage score and Minimum Matching 
Distance. For OccNet [43], GRAF [57], PiGAN [7] and EG3D [8], we use marching cubes to extract 
the underlying geometry. For PointFlow [68], we use Poisson surface reconstruction to convert a 
point cloud into a mesh when evaluating LFD. To evaluate texture quality, we adopt the FID [28] 
metric commonly used to evaluate image synthesis. In particular, for each category, we render the test 
shapes into 2D images, and also render the generated 3D shapes from each model into 50k images 
using the same camera distribution. We then compute FID on the two image sets. As the baselines 
from 3D-aware image synthesis [57, 7, 8] do not directly output textured meshes, we compute FID 
score in two ways: (i) we use their neural volume rendering to obtain 2D images, which we refer to 
as FID-Ori, and (ii) we extract the mesh from their neural field representation using marching cubes, 
render it, and then use the 3D location of each pixel to query the network to obtain the RGB values. 
We refer to this score, that is more aware of the actual 3D shape, as FID-3D. Further details on the 
evaluation metrics are available in the Appendix B.3. 


Experimental Results We provide quantitative results in Table. 2 and qualitative examples in Fig. 3 
and Fig. 4. Additional results are available in the supplementary video. Compared to OccNet [43] that 


_— fe —- 
7 aS | Na 
a = 


Figure 6: Shape interpolation. We interpolate both geometry and texture latent codes from left to 
right. 
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Figure 7: Shape variation. We locally perturb each latent code to generate different shapes. 


uses 3D supervision during training, GET3D achieves better performance in terms of both diversity 
(COV) and quality (MMD), and our generated shapes have more geometric details. PointFlow [68] 
outperforms GET3D in terms of MMD on CD, while GET3D is better in MMD on LFG. We 
hypothesize that this is because PointFlow directly optimizes on point locations, which favours 
CD. GET3D also performs favourably when compared to 3D-aware image synthesis methods, we 
achieve significant improvements over PiGAN [7] and GRAF [57] in terms of all metrics on all 
datasets. Our generated shapes also contain more detailed geometry and texture. Compared with 
recent work EG3D [8]. We achieve comparable performance on generating 2D images (FID-or1), 
while we significantly improve on 3D shape synthesis in terms of FID-3D, which demonstrates the 
effectiveness of our model on learning actual 3D geometry and texture. 


Since we synthesize textured meshes, we can export our shapes into Blender!. We show rendering 
results in Fig. | and 5. GET3D 1s able to generate shapes with diverse and high quality geometry and 
topology, very thin structures (motorbikes), as well as complex textures on cars, animals, and houses. 


Shape Interpolation GET3D also enables shape interpolation, which can be useful for editing 
purposes. We explore the latent space of GET3D in Fig. 6, where we interpolate the latent codes to 
generate each shape from left to rght. GET3D is able to faithfully generate a smooth and meaningful 
transition from one shape to another. We further explore the local latent space by slightly perturbing 
the latent codes to a random direction. GET3D produces novel and diverse shapes when applying 
local editing in the latent space (Fig. 7). 


4.2 Ablations 


We ablate our model in two ways: 1) w/ and w/o volume subdivision, 2) training using different 
image resolutions. Further ablations are provided in the Appendix C.3. 


Ablation of Volume Subdivision As shown in Clase COV (%,t) | MMD (J) 


aera : ; ; Img Res FID (J) 

Tbl. 2, volume subdivision significantly improves LFD CD LED © 
the performance on classes with thin structures - me ee ae Bt ee 
(e.g., motorbikes), while not getting gains on other 10242 66.78 5839 1491 0.71 10.25 
classes. We hypothesize that the initial tetrahedral 1282 38.25 33.98 3886 5.90 43.04 
resolution is already sufficient to capture the de- Chair 512°, 68.80 69.92 3149 3.90 30.16 
: ; 10247 69.08 67.87 3167 3.74 = 23.28 

tailed geometry on Chairs and Cars, and hence the : 

. Mbike 312°, 68-49. 65.75 3421 1.74 74.04 
subdivision cannot provide further improvements. 10242 67.12 6438 3631 1.73 65.60 
haa S127 77.53 78.65 3828 2.01 29.75 


Ablating Different Image Resolutions We ab- ee eee 


late the effect of the training image resolution in 


Tbl. 3. As expected, increased image resolution pee ue Diane aie Uae eT esolNmon: | Tnere 


is better, |: lower is better. 


'We use xatlas [71] to get texture coordinates for the extracted mesh, from where we can warp our 3D mesh 
into a 2D plane and obtain the corresponding 3D location on the mesh surface for any position on the 2D plane. 
We then discretize the 2D plane into an image, and for each pixel, we query the texture field using corresponding 
3D location to obtain the RGB color to get the texture map. 
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Figure 8: Material generation and relighting. Despite being unsupervised, our model generates 
reasonable material properties, and can be realistically rendered with real-world HDR panoramas 
(bottom right). Normals are computed from the generated mesh. Note how specular effects change 
under two different lighting conditions. 


improves the performance in terms of FID and shape quality, as the network can see more details, 
which are often not available in the low-resolution images. This corroborates the importance of 
training with higher image resolution, which are often hard to make use of for implicit-based methods. 


4.3 Applications 
4.3.1 Material Generation for View-dependent Lighting Effects 


GET3D can easily be extended to also generate surface materials that are directly usable in modern 
graphics engines. In particular, we follow the widely used Disney BRDF [6, 32] and describe the 
materials in terms of the base color (R*), metallic (R), and roughness (R) properties. As a result, we 
repurpouse our texture generator to now output a 5-channel reflectance field (instead of only RGB). 
To accommodate differentiable rendering of materials, we adopt an efficient spherical Gaussian 
(SG) based deferred rendering pipeline [12]. Specifically, we rasterize the reflectance field into a 
G-buffer, and randomly sample an HDR image from a set of real-world outdoor HDR panoramas 
Sight = {Lgc}x, where Lgg € R*?*7 is obtained by fitting 32 SG lobes to each panorama. The SG 
renderer [12] then uses the camera c to render an RGB image with view-dependent lighting effects, 
which we feed into the discriminator during training. Note that GET3D does not require material 
supervision during training and learns to generate decomposed materials in an unsupervised manner. 


We provide qualitative results of generated surface materials in Fig. 8. Despite unsupervised, GET3D 
discovers interesting material decomposition, e.g., the windows are correctly predicted with a smaller 
roughness value to be more glossy than the car’s body, and the car’s body is discovered as more 
dielectric while the window is more metallic. Generated materials enable us to produce realistic 
relighting results, which can account for complex specular effects under different lighting conditions. 


4.3.2 Text-Guided 3D Synthesis 


Similar to image GANs, GET3D also supports text-guided 3D content synthesis by fine-tuning 
a pre-trained model under the guidance of CLIP [56]. Note that our final synthesis result is a 
textured 3D mesh. To this end, we follow the dual-generator design from styleGAN-NADA [21], 
where a trainable copy G and a frozen copy Gf of the pre-trained generator are adopted. During 
optimization G, and G'r both render images from 16 random camera views. Given a text query, we 
sample 500 pairs of noise vectors z, and Z2. For each sample, we optimize the parameters of G; to 
minimize the directional CLIP loss [21] (the source text labels are “car”, “animal” and “house”’ for 
the corresponding categories), and select the samples with minimal loss. To accelerate this process, 
we first run a small number of optimization steps for the 500 samples, then choose the top 50 samples 
with the lowest losses, and run the optimization for 300 steps. The results and comparison against a 
SOTA text-driven mesh stylization method, Text2Mesh [44], are provided in Fig. 9. Note that, [44] 
requires a mesh of the shape as an input to the method. We provide our generated meshes from the 
frozen generator as input meshes to it. Since it needs mesh vertices to be dense to synthesize surface 
details with vertex displacements, we further subdivide the input meshes with mid-point subdivision 
to make sure each mesh has 50k-150k vertices on average. 


“burned 
car” 


“polygon 
car” 


“fluffy 
animal” 


“tiger” 


“haunted ral 
house”. 


Text2Mesh Our results 


Figure 9: Text-guided 3D synthesis. Note that Text2Mesh [44] requires 3D mesh geometry as input. 
To fulfil the requirement, we provide our generated geometry as its input mesh. 


5 Conclusion 


We introduced GET3D, a novel 3D generative model that is able to synthesize high-quality 3D 
textured meshes with arbitrary topology. GET3D is trained using only 2D images as supervision. 
We experimentally demonstrated significant improvements on generating 3D shapes over previous 
state-of-the-art methods on multiple categories. We hope that this work brings us one step closer to 
democratizing 3D content creation using A.L.. 


Limitations While GET3D makes a significant step towards a practically useful 3D generative 
model of 3D textured shapes, it still has some limitations. In particular, we still rely on 2D silhouettes 
as well as the knowledge of camera distribution during training. As a consequence, GET3D was 
currently only evaluated on synthetic data. A promising extension could use the advances in instance 
segmentation and camera pose estimation to mitigate this issue and extend GET3D to real-world 
data. GET3D is also trained per-category; extending it to multiple categories in the future, could 
help us better represent the inter-category diversity. 


Broader Impact We proposed a novel 3D generative model that generates 3D textured meshes, 
which can be readily imported into current graphics engines. Our model is able to generate shapes with 
arbitrary topology, high quality textures and rich geometric details, paving the path for democratizing 
A.I. tool for 3D content creation. As all machine learning models, GET3D is also prone to biases 
introduced in the training data. Therefore, an abundance of caution should be applied when dealing 
with sensitive applications, such as generating 3D human bodies, as GET3D 1s not tailored for these 
applications. We do not recommend using GET3D if privacy or erroneous recognition could lead to 
potential misuse or any other harmful applications. Instead, we do encourage practitioners to carefully 
inspect and de-bias the datasets before training our model to depict a fair and wide distribution of 
possible skin tones, races or gender identities. 
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This work was funded by NVIDIA. Jun Gao, Tianchang Shen, Zian Wang and Wenzheng Chen 
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Appendix 


In this Appendix, we first provide detailed description of the GET3D network architecture (Sec. A.1- 
A.4) along with the training procedure and hyperparameters (Sec. A.6). We then describe the datasets 
(Sec. B.1), baselines (Sec. B.2), and evaluation metrics (Sec. B.3). Additional qualitative results, 
ablation studies, robustness analysis, and results on the real dataset are available in Sec. C. Details 
and additional results of the material generation for view-dependent lighting effects are provided in 
Sec. D. Sec E contains more information about the text-guided shape generation experiments as well 
as more additional qualitative results. The readers are also kindly referred to the accompanying video 
(demo.mp4) that includes 360-degree renderings of our results (more than 400 generated shapes for 
each category), detailed zoom-ins, interpolations, material generation, and shapes generated with 
text-guidance. 


A Details of Our Model 


In Sec. 3 we have provided a high level description of GET3D. Here, we provide the implementation 
details that were omitted due to the lack of space. Please consult the Figure B and Figure 2 in the 
main paper for more context. Source code is available at our project webpage 


A.1 Mapping Network 


Following StyleGAN [34, 35], our mapping networks feo and ftex are 8-layer MLPs in which each 
fully-connected layer has 512 hidden dimensions and a leaky-ReLU activation (Figure B). The 
mapping networks are used to map the randomly sampled noise vectors z, € R°!? and z2 € R°!? to 
the latent vectors w; € R°'? and w2 € R°” as wi = feeo(Z1) and we = ftex(Z2). 


A.2 Geometry Generator 


The geometry generator of GET3D starts from a randomly initialized feature volume F'geo € 
IR**4%4%x256 that is shared across the generated shapes, and is learned during training. Through a 
series of four modulated 3D convolution blocks (ModBlock3D in Figure B), the initial volume is 
up-sampled to a feature volume Fi, € R°?*%?*°?* that is conditioned on w,. Specifically, in 
each ModBlock3D, the input feature volume is first upsampled by a factor of two using trilinear 
interpolation. It is then passed through a small 3D ResNet, where the residual path uses a 3D convo- 
lutional layer with kernel size 1x1x1, and the main path applies two conditional 3D convolutional 
layers with kernel size 3x3x3. To perform the conditioning, we follow StyleGAN2 [35] and first map 
the latent vector w, to style h through a learned affine transformation (A in Figure B). The style h is 
then used to modulate (M) and demodulate (D) the weights of the convolutional layers as: 


MiVeni = Pees (4) 
gil _ ! ! 2 
De ie, = Oni | De ohn (5) 


1,k,l,m 


where 6 and 6” are the original and modulated weight, respectively. h; is the style corresponding to 
the ith input channel, 7 is the output channel dimension, and k,/,m denote the spatial dimension of 
the 3D convolutional filter. 


Once we obtain the final feature volume F’,,,, the feature vector f/., € R®°* of each vertex v in the 
tetrahedral grid can be obtained through trilinear interpolation. We additionally feed the coordinates 
of the point p to a [sin(p), cos(p)] positional encoding (PE) and concatenate the output with the 
feature vector f,.,. To decode the concatenated feature vector into the vertex offset Av € R° or 
the SDF value s € R, we pass it through three conditional FC layers (VodFC in Figure B). The 
modulation and demodulation in these layers is done analogously to Eq. 5. All the layers, except for 
the last, are followed by the leaky-ReLU activation function. In the last layer, we apply tanh to either 
normalize the SDF prediction s to be within [-1, 1], or normalize the Av to be within bree ——], 
where tet-res denotes the resolution of our tetrahedral grid, which we set to 90 in all the experiments. 
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Figure B: Network architecture of GET3D. TI and PE denote trilinear interpolation and positional 
encoding, respectively. FC(a, b) represents a fully connected layer with a and 6 denoting the input 
and output dimension, respectively. Similarly, Conv3D(a, b, c) denotes a 3D convolutional layer with 
a input channels, b output channels, and kernel dimension c x c x c. In the Texture Generator, the 
block ModBlock2D(512,512) is repeated four times. All convolutional layers have stride 1. 


Note that for simplicity, we remove all the noise vector from StyleGAN [34, 35] and only have 
stochasticity in the input z. Furthermore, following practices from DEFTET [22] and DMTET [60], 
we us two copies of the geometry generator. One generates the vertex offsets Av, while the other 
outputs the SDF values s. The architecture of both is the same, except for the output dimension and 
activation function of the last layer. 


Volume Subdivision: In cases where modeling at a high- 
resolution is required (e.g. motorbike with thin structures in the 
wheels), we further use volume subdivision following DMTET [60]. 
As illustrated in Fig. A, we first subdivide the tetrahedral grid and 
compute SDF values of the new vertices (midpoints) by averaging 
the SDF values on the edge. Then we identify tetrahedra that have 
vertices with different SDF signs. These are the tetrahedra that 
intersect with the underlying surface encode by SDF. To refine the 
surface at increased grid resolution after subdivision, we further Sion, Gack. ieleanedron-16¢div ded 
predict the residual on SDF values and deformations to update s into g smaller tetrahedra by con- 
and Av of the vertices in identified tetrahedra. Specifically, we use necting midpoints. 

an additional 3D convolutional layer to upsample feature volume 

F\.9 to Fy. of shape 64 x 64 x 64 x 8 conditioned on w;. Then, following the steps described 
above, we use trilinear interpolation to obtain per-vertex feature, concatenate it with PE and decode 
the residuals 0s and dv using conditional FC layers. The final SDF and vertex offset are computed as: 


s’ =s+6s, Av’ = Av + ov. (6) 


A.3 Texture Generator 


We adapt the generator architecture from StyleGAN2 [35] to generate a tri-plane representation of 
the texture field. Similar as in the geometry generator, we start from a randomly initialized feature 
grid Fy, € R****°! that is shared across the shapes, and is learned during training. This initial 
feature grid is up-sampled to a feature grid F/,, € IR2°°*?°°*°® that is conditioned on w, and wo. 
Specifically, we use a series of six modulated 2D convolution blocks (ModBlock2D in Figure B). 
The ModBlock2D blocks are the same as the ModBlock3D blocks, except that the convolution is 2D 
and that the conditioning is on Ww; © Weg, where © denotes concatenation. Additionally, the output 
of each ModBlock2D block is passed through a conditional t7PF layer that applies a conditional 
2D convolution with kernel size 1x1. Note that, following the practices from StyleGAN2 [35], 
the conditioning in the t7PF layers is performed only through modulation of the weights (no 
demodulation). 
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Figure C: Improved generator architecture of GET3D. High-level overview left and detailed 
architecture right. Different to the model architecture proposed in the main paper, the new generator 
shares the same backbone network for both geometry and texture generation. This improves the 
information flow and enables better disentanglement of the geometry and texture. 


The output of the last t7PF layer is then reshaped into three axis-aligned feature planes of size 
256 x 256 x 32. 


To obtain the feature f,.. € R°? of a surface point p € R®, we first project p onto each plane, perform 
bilinear interpolation of the features, and finally sum the interpolated features: 


fiex = )_ p(me(P)), (7) 


where 7,.(p) is the projection of the point p to the feature plane e and p(-) denotes bilinear interpo- 
lation of the features. Color c € R® of the point p is then decoded from f‘ using three conditional 
FC layers (ModFC) conditioned on w, © we. The hidden dimension of each layer is 16. Following 
StyleGAN2 [35], we do not apply normalization to the final output. 


A.4 2D Discriminator 


We use two discriminators to train GET3D: one for the RGB output and one for the 2D silhouettes. 
For both, we use exactly the same architecture as the discriminator in StyleGAN [34]. Empirically, 
we have observed that conditioning the discriminator on the camera pose leads to canonicalization of 
the shape orientations. However, discarding this conditioning only slightly affects the performance, as 
shown in Section C.3. In fact, we primarily use this conditioning to enable the evaluation of geometry 
using evaluation metrics, which assume that the shapes are generated in the canonical frame. 


A.5 Improved Generator 


The motivation for sampling two noise vectors (z;, Z2) in the generator is to enable disentanglement 
of the geometry and texture, where geometry is to be treated as a first-class citizen. Indeed, the 
geometry should only be controlled by the geometry latent code, while the texture should be able to 
not only adapt to the changes in the texture latent code, but also to the changes in geometry, 1.e. a 
change in the geometry latent should propagate to the texture. However, in the original design of 
the GET3D generator (c.f. Sec. 3 and Fig. 2) the information flow from the geometry to the texture 
generator is very limited—concatenation of the two latent codes (Fig. B). Such a weak connection 
makes it hard to learn the disentanglement of geometry and texture and the texture generator can even 
learn to ignore the texture latent code (Fig. D.). 


This empirical observation motivated us to improve the design of the generator network, after the 
initial submission, by improving the information flow, which in turn better supports the disentangle- 
ment of the geometry and texture. To this end, our improved generator shares the same backbone 
network for both geometry and texture generation, as shown in Fig. C. In particular, we follow 
SemanticGAN [38] and use StyleGAN2 [35] backbone. Each ModBlock2D (modulated with the 
geometry latent code w,), now has two tTPF branches, one for generating the geometry feature 
(tGEO), and the other for generating texture features (tTEX). The output of this backbone network are 
two feature triplanes, one for geometry and one for texture. To predict the SDF value and deformation 
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Figure D: Disentanglement of geometry and texture achieved by the original model depicted in 
Fig. 2. In each row, we show shapes generated from the same texture latent code, while changing the 
geometry latent code. In each column, we show shapes generated from the same geometry latent 
code, while changing the texture code. The original model fails to achieve good disentanglement. 


for each vertex in the tetrahedral grid, we project the vertex onto each of the geometry triplanes, 
obtain its feature vector using Eq. 7, and finally use a ModFC to decode s; and Av;. The prediction 
of the color in the texture branch remains unchanged. 


Qualitative result of the geometry and texture disentanglement achieved with this improved generator 
is depicted in Fig. E and F. Shared backbone network allows us to achieve much better disentangle- 
ment of geometry and texture (Fig. D vs Fig. E), while also achieving better quantitative metrics on 
the task of unconditional generation (Tab. 2). 


A.6 Training Procedure and Hyperparameters 


We implement GET3D on top of the official PyTorch implementation of StyleGAN2 [35]. Our 
training configuration largely follows StyleGAN2 [35] including: using a minibatch standard devia- 
tion in the discriminator, exponential moving average for the generator, non-saturating logistic loss, 
and R1 Regularization. We train GET3D along with the 2D discriminators from scratch, without 
progressive training or initialization from pretrained checkpoints. Most of our hyper-parameters are 
adopted form styleGAN2 [35]. Specifically, we use Adam optimizer with learning rate 0.002 and 
6 = 0.9. For R1 regularization, we set the regularization weight 7y to 3200 for chair, 80 for car, 40 
for animal, 80 for motorbike, 80 for renderpeople, and 200 for house. We follow StyleGAN2 [35] 
and use lazy regularization, which applies R1 regularization to discriminators only every 16 training 
steps. Finally, we set the hyperparameter jp that controls the SDF regularization to 0.01 in all the 
experiments. We train our model using a batch size of 32 on 8 A100 GPUs for all the experiments. 
Training a single model takes about 2 days to converge. 


B_ Experimental Details 


B.1 Datasets 


We evaluate GET3D on ShapeNet [9], TurboSquid [4], and RenderPeople [2] datasets. In the 
following, we provide their detailed description and the preprocessing steps that were used in our 
evaluation. Detailed statistic of the datasets is available in Table A. 


*StyleGan3: https: //github.com/NVlabs/stylegan3 (NVIDIA Source Code License) 
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Figure E: Disentanglement of geometry and texture achieved by the improved model depicted in 
Fig. C. In each row, we show shapes generated from the same texture latent code, while changing the 
geometry latent code. In each column, we show shapes generated from the same geometry latent 
code, while changing the texture code. The disentanglement in this model is poor. Comparing with 
Fig. D, this improved model achieves significant better disentanglement of geometry and texture. 


Figure F: Shape Interpolation. We interpolate the latent code from top-left corner to the bottom-right 
corner. In each row, we keep the texture latent code fixed and interpolate the geometry latent code. In 
each column, we keep the geometry latent code fixed and interpolate the texture latent code. GET3D 
adequately disentangles geometry and texture, while also providing a meaningful interpolation for 
both geometry or texture. 


ShapeNet? [9] contains more than 51k shapes from 55 different categories and is the most commonly 
used dataset for benchmarking 3D generative models*. Prior work [68, 75] typically uses the 
categories Airplane, Car, and Chair for evaluation. Herein, we replace the category Airplane with 
Motorcycle, which has more complex geometry and contains shapes with varying genus. Car, Chair, 
and Motorcycle contain 7497, 6778, and 337 shapes, respectively. We random split the shapes of each 
category into training (70%), validation (10%), and test (20%) and remove from the test set shapes 
that have duplicates in the training set. 


TurboSquid? [4] is a large collection of various 3D shapes with high-quality geometry and texture, 
and is thus well suited to evaluate the capacity of GET3D to generate shapes with high-quality 


>The ShapeNet license is explained at https: //shapenet.org/terms 

“Herein, we used ShapeNet v1 Core subset obtained from https: //shapenet . org/ 

https: //www.turbosquid.com, we obtain consent via an agreement with TurboSquid, and following 
license at https: //blog.turbosquid.com/turbosquid-3d-model-license/ 
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Dataset # Shapes # Views pershape’ Rotation Angle Elevation Angle 


ShapeNet Car 7497 24 [0, 27] [$7 57] 
ShapeNet Chair 6778 24 [0, 27] [$7 57] 
ShapeNet Motorbike 337 100 [0, 27] [37, 57] 
Turbosquid Animal 442 100 [0, 277] [47,57] 
Turbosquid House 563 100 [0, 277] [$7 57] 
Renderpeople 500 100 (0, 27] [3m, 57 


Table A: Dataset statistics. 


details. To this end, we use the category Animal that contains 442 textured shapes with high diversity 
ranging from cats, dogs, and lions, to bears and deer [60, 70]. We again randomly split the shapes 
into training (70%), validation (10%), and test (20%) set. Additionally, we provide qualitative results 
on the category House that contains 563 shapes. Since we perform only qualitative evaluation on 
House, we use all the shapes for training. 


RenderPeople® [2] is a large dataset containing photorealistic 3D models of real-world humans. We 
use it to showcase the capacity of GET3D to generate high-quality and diverse characters that can 
be used to populate virtual environments, such as games or even movies. In particular, we use 500 
models from the whole dataset for training and only perform qualitative analysis. 


Preprocessing ‘To generate the data, we first scale each shape such that the longest edge of its 
bounding-box equals e,,, where e,, = 0.9 for Car, Motorcycle, and Human, e,, = 0.8 for House, 
and e,, = 0.7 for Chair and Animal. For methods that use 2D supervision (Pi-GAN, GRAF, EG3D, 
and our model GET3D), we then render the RGB images and silhouettes from camera poses sampled 
from the upper hemisphere of each object. Specifically, we sample 24 camera poses for Car and 
Chair, and 100 poses for Motorcycle, Animal, House, and Human. The rotation and elevation angles 
of the camera poses are sampled uniformly from a specified range (see Table A). For all camera 
poses, we use a fixed radius of 1.2 and the fov angle of 49.13°. We render the images in Blender [15] 
using a fixed lighting, unless specified differently. 


For the methods that rely on 3D supervision, we follow their preprocessing pipelines [68, 43]. 
Specifically, for Pointflow [68] we randomly sample 15k points from the surface of each shape, 
while for OccNet [43] we convert the shapes into watertight meshes by rendering depth frames from 
random camera poses and performing TSDF fusion. 


B.2 Baselines 


PointFlow [68] is a 3D point cloud generative model based on continuous normalizing flows. It 
models the generative process by learning a distribution of distributions. Where the former, denotes 
the distribution of shapes, and the latter the distribution of points given a shape [68]. PointFlow 
generates only the geometry, which is represented in the form of a point cloud. To generate the results 
of [68], we use the original source code provided by the authors’ and train the models on our data. 
To compute the metrics based on LFD, we convert the output point clouds (10k points) to a mesh 
representation using Open3D [77] implementation of Poisson surface reconstruction [36]. 


OccNet [43] is an implicit method for 3D surface reconstruction, which can also be applied to 
unconditional generation of 3D shapes. OccNet is an autoencoder that learns a continuous mapping 
from 3D coordinates to occupancy values, from which an explicit mesh can be extracted using 
marching cubes [39]. When applied to unconditional 3D shape generation, OccNet is trained as a 
variational autoencoder. To generate the results of [43], we use the original source code provided by 
the authors® and train the models on our data. 


°We follow the license of Renderpeople https://renderpeople.com/ 
general-terms-and-conditions/ 

7PointFlow: https: //github.com/stevenygd/PointFlow (MIT License) 

®OccNet: https: //github.com/autonomousvision/occupancy_networks (MIT License) 
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GRAF [57] is a generative model that tackles the problem of 3D-aware image synthesis. GRAF’s 
underlying representation is a neural radiance field—conditioned on the shape and appearance latent 
codes—parameterized using a multi-layer perceptron with positional encoding. To synthesize novel 
views, GRAF utilizes a neural volume rendering approach similar to Nerf [45]. In our evaluation, we 
use the source code provided by the authors’ and train GRAF models on our data. 


Pi-GAN [7] similar to GRAF, Pi-GAN also tackles the problem of 3D-aware image synthesis, but 
uses a Siren [61] network—conditioned on a randomly sampled noise vector—to parameterize the 
neural radiance field. To generate the results of Pi-GAN [7], we use the original source code provided 
by the authors!° and train the models on our data. 


EG3D [8] is a recent model for 3D-aware image synthesis. Similar to our method, EG3D builds upon 
the StyleGAN formulation and uses a tri-plane representation to parameterize the underlying neural 
radiance field. To improve the efficiency and to enable synthesis at higher resolution, EG3D utilizes 
neural rendering at a lower resolution and then upsamples the output using a 2D CNN. The source 
code of EG3D was provided to us by the authors. To generate the results, we train and evaluate EG3D 
on our data. 


B.3 Evaluation Metrics 


To evaluate the performance, we compare both the texture and geometry of the generated shapes S, 
to the reference ones S‘.. 


B.3.1 Evaluating the Geometry 


To evaluate the geometry, we use all shapes of the test set as S,., and synthesize five times as many 
generated shapes, such that |S,| = 5|.S;.|, where | - | denotes the cardinality of a set. Following prior 
work [68, 14], we use Chamfer Distance dep and Light Field Distance dj pp [13] to measure the 
similarity of the shapes, which is in turn used to compute Coverage (COV) and Minimum Matching 
Distance (MMD) evaluation metrics. 


Let X € S, denote a generated shape and Y € S, a reference one. To compute dcp, we first 
randomly sample N = 2048 points X, € R‘*? and Y,, € R‘ *? from the surface of the shapes X 
and Y, respectively'! . The dcp can then be computed as: 


dep(Xp,¥p) = > min |lx— y+ > min |pe— yl (8) 
xEXp> Pp ycY, Pp 


While Chamfer distance has been widely used in the field of 3D generative models and reconstruc- 
tion [11, 22, 60], LFD has received a lot attention in computer graphics [13]. Inspired by human 
perception, LFD measures the similarity between the 3D shapes based on their appearance from 
different viewpoints. In particular, LFD renders the shapes X and Y (represented as explicit meshes) 
from a set of selected viewpoints, encodes the rendered images using Zernike moments and Fourier 
descriptors, and computes the similarity over these encodings. Formal definition of LFD is available 
in [13]. In our evaluation, we use the official implementation to compute dipp'*. 


We combine these similarity measures with the evaluation metrics proposed in [5], which are 
commonly used to evaluate 3D generative models: 
¢ Coverage (COV) measures the fraction of shapes in the reference set that are matched to at 
least one of the shapes in the generated set. Formally, COV is defined as 
argmin D(X,Y)|Y ES 
cov(s;, $,) = Kensminges, PUGY)IY € $4} 


where the distance metric D can be either dep or dtpp. Intuitively, COV measures the 
diversity of the generated shapes and is able to detect mode collapse. However, COV does 


(9) 


°GRAF: https: //github.com/autonomousvision/graf (MIT License) 

'°Pi-GAN: https: //github.com/marcoamonteiro/pi-GAN (License not provided) 

'' For PointFlow [68], we directly use N points generated by the model. 

LED: https://github.com/Sunwinds/ShapeDescriptor/tree/master/LightField/ 
3DRetrieval_vi.8/3DRetrieval_v1.8 (License not provided) 
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not measure the quality of individual generated shapes. In fact, it is possible to achieve high 
COV even when the generated shapes are of very low quality. 


e Minimum Matching Distance (MMD) complements COV metric, by measuring the quality 
of the individual generated shapes. Formally, MMD is defined as 


i 


MMD(S,, S;-) min D(X,Y), (10) 


|S;,| reg. 7 oo? 
where D can again be either dep or dipp. Intuitively, MMD measures the quality of the 
generated shapes by comparing their geometry to the closest reference shape. 


B.3.2. Evaluating the Texture and Geometry 


To evaluate the quality of the generated textures, we adopt the Fréchet Inception Distance (FID) 
metric, commonly used to evaluate the synthesis quality of 2D images. In particular, for each category, 
we render 50k views of the generated shapes (one view per shape) from the camera poses randomly 
sampled from the predefined camera distribution, and use all the images in the test set. We then 
encode these images using a pretrained Inception v3 [63] model'*, where we consider the output of 
the last pooling layer as our final encoding. The FID metric can then be computed as: 


FID(Sy, Sy) = ||Hg — Hr||3 + Tr[Ey + EU, — 2(Dy,-)'/?]|], (11) 


where Tr denotes the trace operation. 44, and 4, are the mean value and covariance matrix of the 
generated image encoding, while jz, and J, are obtained from the encoding of the test images. 


As briefly discussed in the main paper, we use two variants of FID, which differ in the way in which 
the 2D images are rendered. In particular, for FID-Ori, we directly use the neural volume rendering 
of the 3D-aware image synthesis methods to obtain the 2D images. This metric favours the baselines 
that were designed to directly generate valid 2D images through neural rendering. Additionally, we 
propose a new metric, FID-3D, which puts more emphasis on the overall quality of the generated 3D 
shape. Specifically, for the baselines which do not output a textured mesh, we extract the geometry 
from their underlying neural field using marching cubes [39]. Then, we find the intersection point of 
each pixel ray with the generated mesh and use the 3D location of the intersected point to query the 
RGB value from the network. In this way, the rendered image is a more faithful representation of 
the underlying 3D shape and takes the quality of both geometry and texture into account. Note that 
FID-3D and FID-Ori are identical for methods that directly generate textured 3D meshes, as it is the 
case with GET3D. 


C Additional Results on the Unconditioned Shape Generation 


In this section we provide additional results on the task of unconditional 3D shape generation. First, 
we perform additional qualitative comparison of GET3D with the baselines in Section C.1. Second, 
we present further qualitative results of GET3D in Section C.2. Third, we provide additional ablation 
studies in Section C.3. We also analyse the robustness and effectiveness of GET3D. Specifically, 
in Sec. C.4 and C.5, we evaluate GET3D trained with noisy cameras and 2D silhouettes predicted 
by 2D segmentation networks. We further provide addition experiments on StyleGAN generated 
realistic dataset from GANVverse3D [73] in Sec. C.6. Finally, we provide additional comparison with 
EG3D [8] on human character generation in Sec. C.7. 


C.1 Additional Qualitative Comparison with the Baselines 


Comparing the Geometry of Generated Shapes We provide additional visualization of the 3D 
shapes generated by GET3D and compare them to the baseline methods in Figure Q. GET3D 1s able 
to generate shapes with complex geometry, different topology, and varying genus. When compared 
to the baselines, the shapes generated by GET3D contain more details and are more diverse. 


‘Inception network checkpoint path: http: //download.tensorflow.org/models/image/imagenet/ 
inception- 2015-12-05 .tgz 
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Figure G: Shape retrieval of our generated shapes. We retrieve the closest shape in the training set for each 
of shapes we showed in the Figure |. Our generator is able to generate novel shapes that are different from the 
training set 
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Figure H: Training loss curve for discriminator. We compare the training dynamics of using a 
single discriminator on both RGB image and 2D silhouette, with the ones using two discriminators for 
each image, respectively. The horizontal axis represents the number of images that the discriminators 
have seen during training (mod by 1000). Two discriminators greatly reduce training instability and 
help us obtain good results. 


Comparing the Synthesized Images We provide additional results on the task of 2D image 
generation in Figure R. Even though GET3D is not designed for this task, it produces comparable 
results to the strong baseline EG3D [8], while significantly outperforming other baselines, such as 
PiGAN [7] and GRAF [57]. Note that GET3D directly outputs 3D textured meshes, which are 
compatible with standard graphics engines, while extracting such representation from the baselines is 
non-trivial. 


C.2 Additional Qualitative Results of GET3D 


We provide additional visualizations of the generated geometry and texture in Figures S-X. GET3D 
can generate high quality shapes with diverse textures across all the categories, from chairs, cars, 
and animals, to motorbikes, humans, and houses. Accompanying video (demo.mp4) contains further 
visualizations, including detailed 360° turntable animations for 400+ shapes and interpolation results. 


Closest Shape Retrieval To demonstrate that GET3D is capable of generating novel shapes, we 


perform shape retrieval for our generated shapes. In particular, we retrieve the closest shape in the 
training set for each of shapes we showed in the Figure | by measuring the CD between the generated 
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Two Discriminators 
Single Discriminator 


Figure I: Training loss curve for generator. We compare the training dynamics for using single 
discriminator on both RGB image and 2D silhouette with two discriminators for each image, respec- 
tively. The horizontal axis represents the number of images discriminator have seen during training 
(mod by 1000). 


Model FID 


GET3D w.o. Camera Condition 11.63 
GET3D w/ Camera Condition 10.25 


Table B: Ablations on using camera condition: We ablate using camera condition for discriminator. We train 
the model on Shapenet Car dataset. 


shape and all training shapes. Results are provided in Figure G. All generated shapes in Figure | 
significantly differ from their closest shape in the training set, exhibiting different geometry and 
texture, while still maintaining the quality and diversity. 


Volume Subdivision We provide further qualitative results highlighting the benefits of volume 
subdivision in Figure Y. Specifically, we compare the shapes generated with and without volume 
subdivision on ShapeNet motorbike category. Volume subdivision enables GET3D to generate finer 
geometric details like handle and steel wire, which are otherwise hard to represent. 


C.3 Additional Ablations Studies 


We now provide additional ablation studies in an attempt to further justify our design choices. 
In particular, we first discuss the design choice of using two dedicated discriminators for RGB 
images and 2D silhouettes, before ablating the impact of adding the camera pose conditioning to the 
discriminator. 


C.3.1 Using Two Dedicated Discriminators 


We empirically find that using a single discriminator on both RGB image and silhouettes introduces 
significant training instability, which leads to divergence when training GET3D. We provide a 
comparison of the training dynamics in Figure H and I, where we depict the loss curves for the 
generator and discriminator. We hypothesize that the instability might be caused by the fact that 
a single discriminator has access to both geometry (from 2D silhouettes) and texture (from RGB 
image) of the shape, when classifying whether the image is real or not. Since we randomly initialize 
our geometry generator, the discriminator can quickly overfit to one aspect—either geometry or 
texture—and thus produces bad gradients for the other branch. A two-stage approach in which two 
discriminators would be used in the first stage of the training, and a single discriminator in the later 
stage, when the model has already learned to produce meaningful shapes, is an interesting research 
direction, which we plan to explore in the future. 


C.3.2 Ablation on Using Camera Condition for Discriminator 


Since we are mainly operating on synthetic datasets in which the shapes are aligned to a canonical 
direction, we condition the discriminators on the camera pose of each image. In this way, GET3D 


24 


Figure J: Additional qualitative results of GET3D trained with noisy cameras. We render generated shapes 
in Blender. The visual quality is similar to original GET3D in the main paper. 
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Figure K: Additional qualitative results of GET3D trained with predicted 2D silhouettes (Mask-Black). 
We render generated shapes in Blender. The visual quality is similar to original GET3D in the main paper. 
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Figure L: Additional qualitative results of GET3D trained with predicted 2D silhouettes (Mask-Random). 
We render generated shapes in Blender. The visual quality is similar to original GET3D in the main paper. 


learns to generate shapes in the canonical orientation, which simplifies the evaluation when using 
metrics that assume that the input shapes are canonicalized. We now ablate this design choice. 
Specifically, we train another model without the conditioning and evaluate its performance in terms 
of FID score. Quantitative results are given in Table. B. We observe that removing the camera pose 
conditioning, only slightly degrades the performance of GET3D (-1.38 FID). This confirms that our 
model can be successfully trained without such conditioning, and that the primary benefit of using it 
is the easier evaluation. 


Method FID 


GET3D - original 10.25 
GET3D - noisy cameras 19.53 


GET3D - predicted 2D silhouettes (Mask-Black) 29.68 
GET3D - predicted 2D silhouettes (Mask-Random) 33.16 


Table C: Additional quantitative results for noisy cameras and using predicted 2D silhouettes on 
Shapenet Car dataset. 


C.4 Robustness to Noisy Cameras 


To demonstrate the robustness of GET3D to imperfect cameras poses, we add Gaussian noises to 
the camera poses during training. Specifically, for the rotation angle, we add a noise sampled from 
a Gaussian distribution with zero mean, and 10 degrees variance. For the elevation angle, we also 
add a noise sampled from a Gaussian distribution with zero mean, and 2 degrees variance. We use 
ShapeNet Car dataset [9] in this experiment. 


The quantitative results are provided in Table C and qualitative examples are depicted in Figure J. 
Adding camera noise harms the FID metric, whereas we observe only little degradation in visual 
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Figure M: Additional qualitative results of GET3D trained with "real" GANverse3D [73] data. We render 
generated shapes in Blender. 


Figure N: We show randomly sampled 2D images and silhouettes from GANverse3D [73] data. Note the 
realism of the images and the imperfections of the 2D silhouettes. 


quality. We hypothesize that the drop in the FID is a consequence of the camera pose distribution 
mismatch, which occurs as result of rendering the testing dataset, used to calculate the FID score, 
with a camera pose distribution without added noise. Nevertheless, based on the visual quality of 
the generated shapes, we conclude that GET3D 1s robust to a moderate level of noise in the camera 
poses. 


C.5 Robustness to Imperfect 2D Silhouettes 


To evaluate the robustness of GET3D when trained with imperfect 2D silhouettes, we replace ground 
truth 2D masks with the ones obtained from Detectron2'* using pretrained PointRend checkpoint, 
mimicking how one could obtain the 2D segmentation masks in the real world. Since our training 
images are rendered with the black background, we use two approaches to obtain the 2D silhouettes: 
1) we directly feed the original training image into Detectron2 to obtain the predicted segmentation 
mask (we refer to this as Mask-Black), and 11) we add a background image, randomly sampled 
from PASCAL-VOC 2012 dataset (we refer to this as Mask-Random). In this setting, the pretrained 
Detectron2 model achieved 97.4 and 95.8 IoU for the Mask-Black and Mask-Random versions, 
respectively. We again use the Shapenet Car dataset [9] in this experiment. 


Experimental Results Quantitative results are summarized in Table C, with qualitative examples 
provided in Figures K and L. Although we observe drop in the FID scores, qualitatively the results 
are still similar to the original results in the main paper. Our model can generate high quality shapes 
even when trained with the imperfect masks. Note that, in this scenario, the training data for GET3D 
is different from the testing data that is used to compute the FID score, which could be one of the 
reasons for worse performance. 


C.6 Experiments on ''Real'' Image 


Since many real-world datasets lack camera poses, we follow GANverse3D [73] and utilize pretrained 
2D StyleGAN to generate a realistic car dataset. We train GET3D on this dataset to demonstrate the 
potential applications to real-world data. 


Experimental Setting Following GANverse3D [73], we manipulate the latent codes of 2D Style- 
GAN and generate multi-view car images. To obtain the 2D segmentation of each image, we use 


“nttps://github.com/facebookresearch/detectron2 
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Method FID (J) 
Ori 3D 


EG3D [8] 13.77 60.42 
GET3D 14.27 14.27 


Table D: Additional quantitative comparison with EG3D [8] on Human Body dataset [2]. 


EG3D [8] Ours Ours-Tex 


Figure O: Additional qualitative comparison on Human Body dataset. We compare our method with 
EG3D [8] on the extracted geometry. 


DatasetGAN [74] to predict the 2D silhouette. We then use Sf{M [65] to obtain the camera initializa- 
tion for each generated image. We visualize some examples of this dataset in Fig N and refer the 
reader to the original GANverse3D paper for more details. Note that, in this dataset both cameras 
and 2D silhouettes are imperfect. 


Experimental Results We provide qualitative examples in Fig. M. Even when faced with the 
imperfect inputs during training, GET3D is still capable of generating reasonable 3D textured 
meshes, with variation in geometry and texture. 


C.7 Comparison with EG3D on Human Body 


Following the suggestion of the reviewer, we also train EG3D model on the Human Body dataset 
rendered from Renderpeople [2] and compare it to the results of GET3D. 


Quantitative results are available in Table D and qualitative comparisons in Figure O. GET3D 
achieves comparable performance to EG3D [8] in terms of generated 2D images (FID-ori), while 
significantly outperforming it on 3D shape synthesis (FID-3D). This once more demonstrates the 
effectiveness of our model in learning actual 3D geometry and texture. 


D Material Generation for View-dependent Lighting Effects 


In modern computer graphics engines such as Blender [15] and Unreal Engine [32], surface properties 
are represented by material parameters crafted by graphics artists. To make the generated assets 
graphics-compatible, one direct extension of our method is to also generate surface material properties. 
In this section, we describe how GET3D is able to incorporate physics-based rendering models, 
predicting SVBRDF to represent view-dependent lighting effects such as specular surface reflections. 


As described in main paper Sec. 4.3.1, two modules need to be adapted to facilitate material generation. 
Namely, the texture generation and the rendering process. Specifically, we repurpose the texture 
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generator branch to predict the Disney BRDF properties [6, 32] on the surface as a reflectance field. 
Specifically, the texture generator now outputs a 5-channel reflectance property, including surface 
base color Chase € R°, roughness 3 € R and metallic m € R parameters. 


Note that different from a texture field, rendering the reflectance field requires one additional shading 
step after rasterization into the G-buffer. Thus, the second adaptation is to replace the texture 
rasterization with an expressive rendering model capable of rendering the reflectance field. According 
to the non-emissive rendering equation [31], the outgoing radiance L, at the camera direction w, is 
given by: 


Do (Wo) = I. Dy (w) fr (wi, Wo; Chase; oy m)(n . w;)T dw, (12) 


where L; is the incoming radiance, j, is the BRDF, n is the normal direction on the surface points, 
n - w; is the cosine foreshortening term, w; is incoming light direction sampled on sphere S”, while 
(n - w;)* = max(n- w;,0) constrains the integration over the positive hemisphere. Standard ray 
tracing technique adopts Monte Carlo sampling methods to estimate this integral, but this incurs large 
computation and memory cost. Inspired by [64, 72, 12], we instead employ a spherical Gaussian 
(SG) rendering framework [12], which approximates every term in Eq. (12) with SGs and allows us 
to analytically compute the outgoing radiance without sampling any rays, from where we can obtain 
the RGB color for each pixel in the image. We refer the reader to [12] for more details. 


Similar to the original training pipeline, we randomly sample light from a set of real-world outdoor 
HDR panoramas (detailed in the following “Datasets” paragraph) and render the generated 3D assets 
into 2D images using cameras sampled from the camera distribution of training set. We train the 
model using the same method as in the main paper by adopting the discriminators to encourage the 
perceptual realism of the rendered images under arbitrary real-world lighting, along with a second 
discriminator on the 2D silhouettes to learn the geometry. Note that no supervision from material 
ground truth is used during training, and the material decomposition emerges in a fully unsupervised 
manner. When equipped with a physics-based rendering models, GET3D successfully predicts 
reasonable surface material parameters, generating delicate models which can be directly utilized in 
stand rendering engines like Blender [15] and Unreal [32]. 


Datasets We collect a set of 724 outdoor HDR panoramas from HDRIHaven!’, DoschDesign!® 
and HDRMaps!’, which cover a diverse range of real-world lighting distribution for outdoor scenes. 
We also apply random flipping and random rotation along azimuth as data augmentation. During 
training, we convert all the environment maps to SG lighting representations, where we adopt 32 SG 
lobes, optimizing their directions, sharpness and amplitudes such that the approximated lighting is 
close to the environment map. We optimize 7000 iterations with MSE loss and Adam optimizer. The 
converged SG lighting can preserve the most contents in the environment map. 


As ShapeNet dataset [9] does not contain consistent material definition, we additionally collect 1494 
cars from Turbosquid [4] with materials consistently defined with Disney BRDF. To render the dataset 
using Blender [15], we follow the camera configuration of ShapeNet Car dataset, and randomly select 
from the collected set of HDR panoramas as lighting. In the dataset, the groundtruth roughness for 
car windows is in the range of [0.2, 0.4] and the metallic is set to 1; for car paint, the groundtruth 
roughness is in the range of [0.3, 0.6] and the metallic is set to 0. We disable complex materials such 
as the transparency and clear coat effects, such that the rendered results can be interpreted by the 
basic Disney BRDF properties including base color, metallic and roughness. 


Evaluation metrics Since we aim to generate 3D assets that can be used in graphics workflow to 
produce realistic 2D renderings, we quantitatively evaluate the realism of the 2D rendered images 
under real-world lighting using FID score. 


Comparisons _ To the best of our knowledge, up to date no generative model can directly generate 
complex geometry (meshes) with material information. We therefore only compare different version 
of our model. In particular, we compare the results to the texture prediction version of GET3D, 


'®yolyhaven.com/hdris (License: CCO) 


'©doschdesign.com (License: doschdesign.com/information. php?p=2) 
'Thdrmaps .com (License: Royalty-Free) 
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where we do not use material and directly predict RGB color for the surface points. We then ablate 
the effects of using real-world HDR panoramas for lighting, which are typically hard to obtain. To 
this end, we manually use two spherical Gaussians for ambient lighting and a random directions 
to simulate the lighting when rendering the generated shapes during training, and try to learn the 
materials under this simulated lighting. 


Results The quantitative FID scores are provided in Table E. With material generation, the FID score 
improves by more than 2 points when compared to the texture prediction baseline (18.53 vs 20.78). 
This indicates that the material generation version of GET3D has better capacity and improved 
realism compared to the texture only baseline. When using the simulated lighting, instead of real- 
world HDR panorama, the FID score gets slightly worse but still produces reasonable performance. 
We further provide additional qualitative results in Fig. P visualizing rendering results of generated 
assets under different real-world lighting conditions. We import our generated assets in Blender and 
show animated visualization in the accompanied video (demo.mp4). 


Method FID 

Ours (Texture) 20.78 
Ours + Material (Ambient and directional light) 22.83 
Ours + Material (Real-world light) 18.53 


Table E: Quantitative FID results of material generation. 


E_ Text-Guided 3D Synthesis 


Technical details. As briefly described in Sec. 4.3.2, our text-guided 3D synthesis method follows 
the dual-Generator design from StyleGAN-NADA [21], and uses the directional CLIP loss [21]. In 
particular, at each optimization iteration, we randomly sample NV = 16 camera views and render NV 
paired images using two generators: the frozen one (Gf) and the trainable one (G;). The directional 
CLIP loss can then be computed as: 


N 


AL, AT 
tiie 
lip Ws ARIAT AT] ) 


where AJ; = E(R(G:i(w), c;)) — E(R(Gs(w), c)) is the translation of the CLIP embeddings (F) 
from the rendering with G+ to the rendering with G;, under camera c; and AT’ is the CLIP embedding 
translation from the class text label to the provided query text. In our implementation, we used two 
pre-trained CLIP models with different Vision Transformers (*ViT-32/B’ and ‘ViT-B/16’) [18] for 
different level of details, and follow the text augmentation as in the StyleGAN-NADA codebase!®. 


'Snttps: //github.com/rinongal/StyleGAN-nada (MIT License) 
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Figure P: Material generation and relighting. We visualize seven generated cars’ material properties 
and relight with four different lighting conditions. 
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Figure Q: Generated 3D Geometry. Additional qualitative comparison with baseline methods on generated 
3D geometry 
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Figure R: Generated Image. 
images. 
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Additional qualitative comparison with baseline methods on generated 2D 
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Figure S: Qualitative results on ShapeNet cars. 
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Figure T: Qualitative results on ShapeNet chairs. 
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Figure V: Qualitative results on Turbosquid animals. 
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Figure W: Qualitative results on ShapeNet motorbikes. 
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Figure X: Qualitative results on Renderpeople. 
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w/o volume subdivision w/ volume subdivision 


Figure Y: We compare results with and without applying volume subdivision on ShapeNet motorbikes. 
With volume subdivision, our model can generate finer geometric details like handle and steel wire. 
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